metaGWASmanager: a toolbox for an automated workflow from phenotypes to meta-analysis in GWAS consortia

Abstract Summary This article introduces the metaGWASmanager, which streamlines genome-wide association studies within large-scale meta-analysis consortia. It is a toolbox for both the central consortium analysis group and participating studies to generate homogeneous phenotypes, minimize unwanted variability from inconsistent methodologies, ensure high-quality association results, and implement time-efficient quality control workflows. The toolbox features a plug-in-based approach for customization of association testing. Results The metaGWASmanager toolbox has been successfully deployed in both the CKDGen and MetalGWAS Initiative consortia across hundreds of participating studies, demonstrating its effectiveness in GWAS analysis optimization by automating routine tasks and ensuring the value and reliability of association results, thus, ultimately promoting scientific discovery. We provide a simulated data set with examples for script customization so that readers can reproduce the pipeline at their convenience. Availability and implementation GitHub: https://github.com/genepi-freiburg/metaGWASmanager


Introduction
Genome-Wide Association Studies (GWAS) aim to unravel genes implicated in health and disease and advance our understanding of physiology and pathology (Tam et al. 2019).In GWAS, obtaining meaningful results involves increasing sample sizes through meta-analysis, as exemplified by the saturation of height GWAS signals with multi-million individuals (Yengo et al. 2022).Partnership within consortia, as seen in the collaborative efforts of CKDGen over a decade (K€ ottgen and Pattaro 2020), is critical to achieving these sample sizes.Consortia typically establish a centralized analysis group, and study-specific analysts follow a predefined analysis plan.To ensure reliable results, maintaining high-quality, standardized phenotypic data and genotypes across studies and quality control on the contributing study summary statistics are imperative.
Therefore, we aim to offer a versatile framework that can be universally applied to accomplish complex tasks with different tools for phenotype generation and statistical analyses.This intends to minimize between-studies heterogeneity originating from the use of inconsistent methodologies, disparate data preparation, or divergent tools for statistical tests, and to save time by automating routine tasks both for the central Consortium Analysts (CA) group and the Study Analysts (SA).
In this article, we introduce a comprehensive toolbox leveraging existing software packages and streamlining the entire consortia workflow.This includes harmonization and quality control of phenotypes, GWAS, quality control reports of GWAS (in the form of test statistics and plots) and metaanalysis, providing an integrated solution for study sites and central analysis teams (Fig. 1).

Methods and workflow
For the presented tools, we build on R, Bash and Python as a foundation universally available across study sites.While CAs benefit from some familiarity with these languages to customize the toolbox to the consortium, SAs do not need specific coding skills.To fulfill the unique needs of each consortium, we adopted a plug-in concept, allowing for the flexible customization of the analysis pipeline with respect to phenotype preparation and used association software tools.
Our toolbox addresses two groups of users.First, the CA group is responsible for the analysis plan, pre-and post- GWAS quality control design, meta-analysis, and other downstream analyses.CAs do not have access to individuallevel participant data.Second, the SA performs the phenotype preparation and association analyses at the study level and upload their results to the consortium server.
The workflow comprises five phases and is presented in Fig. 1.In the first phase, starting with an analysis plan, the CAs customize the metaGWASmanager scripts according to the specific requirements of the consortium.The CAs can easily adopt the consortium-specifics.R plug-in file to meet diverse requirements aligned with the accompanying analysis plan.This enables customization of the entire analysis process, ranging from SA phenotype preparation over the actual GWAS to the centralized summary statistics quality control and meta-analysis by the CA group.MetaGWASmanager currently implements plink 2.0 (Chang et al. 2015) and regenie (Mbatchou et al. 2021) association softwares and can easily be customized for other tools.The analysis plan and the customized toolbox are distributed to the participating SAs.
In the second phase, the SAs prepare a study-specific phenotype file, using a programming language of their choice, and run the toolbox to check and summarize it.The script conducts a thorough examination of the input file, issuing warnings and detecting errors to ensure high-quality phenotypic data and avoid common pitfalls like wrong unit conversions or variations in assay methods.Furthermore, the pipeline performs necessary trait transformations, stratification, and calculations, standardizing the phenotypic data for consistency.SAs then upload the generated phenotype summary statistics (not containing individual-level data) and diagnostic plots (Supplementary Data), including descriptive graphs of trait-specific variables to the CAs.
In the third phase, CAs inspect phenotype distributions, detect outliers, and validate across studies (Fig. 1).After phenotype approval, in phase 4, studies proceed with the preparation of their genotypic data in a compatible format, with the help of provided scripts, and execute the GWAS.The toolbox provides job scripts and includes GWAS tool command line options to ensure best practice analysis.Results are packaged in a standardized way to facilitate CA group analyses.These include marker association statistics, phenotype summaries of the phenotypes actually used for association after sample exclusions, and diagnostic log files.
In phase 5, to maintain the highest possible data quality, the CAs employ code to properly format and verify GWAS statistics.GWAS files are examined using the GWASinspector R package (Ani et al. 2021).A wide variety of diagnostic reports are generated to facilitate the check for several potential issues, including discrepancies in genomic builds, allele switches and swaps, missing data, file and number formatting issues, unaccounted inflation, and incorrect transformations.The CA group communicates any identified issue back to the respective SA for resolution.
Lastly, our pipeline facilitates progress tracking and reporting by generating tables and plots that allow for efficient monitoring of the consortium's progress, such as ongoing GWAS status, studies awaiting analysis, and current sample sizes.These tracking mechanisms are crucial for effective coordination and management of the consortium effort.Once data collection and QC is complete, the actual GWAS meta-analysis is conducted using METAL (Willer et al. 2010).

Applications in the CKDGen and metals consortia
The presented toolbox has been developed in the context of the CKDGen consortium (Teumer et al. 2019, Tin et al. 2019, 2021, Wuttke et al. 2019, Schlosser et al. 2021) and has recently been successfully applied in an ongoing largescale GWAS meta-analyses with over 130 participating studies of >2 million individuals with 18 endpoints and a total of >1,800 unique GWAS.We show exemplary phenotype and cleaning diagnostic plots and reports along with comprehensive documentation in the Supplementary data.
To prove the transferability of the workflow, we implemented the same tools with the recently established MetalGWAS Initiative studying >40 phenotypes in �10 studies (https://biodama.isciii.es/metal-gwas/).
Furthermore, we provide sample scripts to run the complete process using simulated data.

Conclusion
The metaGWASmanager aids to identify an increased number of associated variants by reducing between-study heterogeneity from inconsistent methodologies in phenotype preparation and divergent tools in association analyses.By establishing output file consistency, it supports the metaanalyses processes.Its technical foundation in R, Bash and Python, along with the flexible plug-in concept and comprehensive documentation, facilitates quick implementations in a wide range of settings.The provided simulated data set with examples of script customization will allow readers to easily reproduce and familiarize themselves with the pipeline.
The metaGWASmanager tool significantly streamlines analytical workflows by automating routine tasks, ultimately saving valuable time of the GWAS consortium team and making the scientific output more robust, thus facilitating biological discovery.

Figure 1 .
Figure1.Workflow.Illustration of the metaGWASmanager pipeline, outlining phenotype generation and quality assurance, followed by the GWAS, GWAS-QC and meta-analysis steps, along with the required inputs and resulting outputs.Shaded and white sections indicate tasks to be carried out by CAs and SAs, respectively.